Reinforcement learning approach to control an inverted pendulum: A general framework for educational purposes

Machine learning is often cited as a new paradigm in control theory, but is also often viewed as empirical and less intuitive for students than classical model-based methods. This is particularly the case for reinforcement learning, an approach that does not require any mathematical model to drive a system inside an unknown environment. This lack of intuition can be an obstacle to design experiments and implement this approach. Reversely there is a need to gain experience and intuition from experiments. In this article, we propose a general framework to reproduce successful experiments and simulations based on the inverted pendulum, a classic problem often used as a benchmark to evaluate control strategies. Two algorithms (basic Q-Learning and Deep Q-Networks (DQN)) are introduced, both in experiments and in simulation with a virtual environment, to give a comprehensive understanding of the approach and discuss its implementation on real systems. In experiments, we show that learning over a few hours is enough to control the pendulum with high accuracy. Simulations provide insights about the effect of each physical parameter and tests the feasibility and robustness of the approach.

Low-level Interface (LLI) At each major control cycle, the LLI processes the raw measurements from the encoders by smoothing them with a digital 4th-order Butterworth filter [28] and by differentiating them numerically in order to estimateẋ andθ. For the communication, we use the ZeroMQ 1 library. This allows to write client controller applications that do not need to focus on the low-level management of hardware resources. In addition, clients can be run either from the Raspberry Pi 4 or from any other machine that is able to connect to the board, e.g., via the local network or via WiFi. This opens the possibility to write client applications in potentially any programming language supported by ZeroMQ. Our client applications are written in Python and C++.
Measurements of the physical parameters The values of the physical parameters of the cart-pole are displayed in Table 1. The pendulum mass was measured with a scale. The natural frequency ω and viscous friction coefficient k v were inferred from the signal θ(t) of the free oscillations of the pendulum with a blocked cart as expected by Eq. 1 of the main text withẍ = 0. We show in fig. 1a, the relaxation dynamics of the pendulum, as well as the result of the numerical prediction of the model with the best fitted parameters. The parameters τ , f c , f d and k U in Eqs. (2-3) (of the main text) are inferred by imposing step functions as voltages and measuring the cart velocity as a function of time. Again, parameters are deduced by the best interpolations (Fig. 1b). In Fig. 1c, we observe in more details the effect of the three parameters f c , f d and k U on the discontinuity on the velocity-axis, the up-down asymmetry and the slope respectively, while plotting the steady state velocity as a function of the applied voltage. The uncertainty on the angular velocityθ is correlated to σ θ and to the time resolution ∆t ≃ 0.05 s. This gives an uncertainty σθ = σ θ /∆t ≃ 52 mrad s −1 .
Methodology for training RL agents All the simulations and experiments were driven by a Dell Precision 7550 using its internal GPU. For one simulation with 15 10 5 time steps with logging and evaluation loops, it takes 10.7 minutes using GPU (NVidia Quadro T2000), and 13.43 minutes using CPU only (Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz).  Table 1. Measured physical parameters.

Artificial Neural Networks
Artificial Neural Networks (ANN) are an assembly of idealized biological neurons [12]. Each neuron, labelled k, possesses a state S k and receives a signal p k from other neurons. This incoming information writes: where w jk measures the weight of the link between the neurons j and k. In general, there is also a bias w 0k for each neuron. The incoming signal p k is treated through a function f to define the new state of the neuron: where f is the activation function.
In our problem, the ANN's input layer have 5 neurons that handle the five components of the observation (sin(θ), cos(θ),θ, x,ẋ). Each of these five neurons is connected to the first hidden layer consisting of 256 nodes, which are also connected to a second hidden layer of also 256 nodes. For the two hidden layers, we use the Rectified Linear Unit function (ReLU) [12,29]: The network's output layer is made up of 3 neurons, which gather information from the previous hidden layer. Each output neuron represents the action-value of 3 possible actions for the current state. The training process updates the unknown parameters w jk and w 0k , in order to minimize the error between the output of the ANN, and the estimated true value based on the real reward given by the environment.

Additional techniques
In addition, to stabilize the learning process and obtain more reliable results, DQN also employs a number of additional techniques such as replay buffer, fixed Q-targets [7] and gradient clipping which improves the stability of learning by clipping the TD error eq. (2) to [-1,1] interval. In our configuration, the learning process happens at the end of every episode through the use of gradient descent applied on mini-batches of transitions (s i , a i , r i , s i+1 ) sampled from a buffer composed by 50000 (state/action/reward/next state) values.
This is the concept of an experience replay. The replay buffer permits to reuse the transitions in an update, and also to break temporal correlation of those transitions for the learning process by shuffling the data. The weight updates of the neural network follows the steepest gradient scheme: where w jk refers to the local network parameters, w − jk refers to the target network parameters. The TD target approximates the true Q(s i , a i ) in Eq. (10) of the main text, and the update is done proportionally to the error between the approximated true action-value function and the current value. To avoid the the target value that changes frequently over time, the fixed network was introduced, thus decoupling the target value from the weight update. This increased robustness and stability of the learning. After 1000 time steps (table 2), the target network parameters w − jk are updated with the local network parameters w jk [29]. Hyperparameters for RL Q-learning The hyperparameters for Q-learning were set as follows. We set α = 0.01 in Eq. (10) of the main text. As for the hyperparameter ϵ (ϵ-greedy policy), it is a good practice to promote the exploration in the early stage of the learning process with ϵ close to 1, while a small ϵ helps to converge quickly at the end of the process. Here ϵ decreases as a function of time: ϵ = max(ϵ min , min(1, 1 − log 10 ((n + 1)/d)) , where n is the number of current episode. The decay coefficients d and ϵ min are hyperparameters that can be tuned ; in this work we took d = N T /10 and ϵ min = 0.1.
DQN The parameters were tuned with the help of Optuna [31] framework. We discovered that the most sensible hyperparameters are network architecture, batch size and exploration rate. The complete tuning focused on the following parameters: • Buffer size: the size of a buffer with transitions (cos(θ), sin(θ),θ, x,ẋ) used for learning the weights of the policy.
• Batch size: number of samples used for the gradient descent update of neural network. In practice it should be large enough to avoid biased experience, but not too large to slow the learning.
• Learning rate: an extent at which we update the "state-value" function at each step.
• Gamma: discount rate of future steps, which tells how the present is more valuable than the future.
• Exploration rate: the rate at which an agent explores (acts randomly) in the environment • Network size: size of the dense neural network for Q(s, a) function approximation • Target update interval: it is the interval after which we update the target. In general, the bigger the value, the more stable is the training, but decreases the learning speed • Train frequency: the frequency of learning the weights from the experience; in our case we train the neural network at the end of every episode, since it is the most suitable way to be applied on real-life robotic reinforcement learning